8 research outputs found

    Authorship attribution in portuguese using character N-grams

    Get PDF
    For the Authorship Attribution (AA) task, character n-grams are considered among the best predictive features. In the English language, it has also been shown that some types of character n-grams perform better than others. This paper tackles the AA task in Portuguese by examining the performance of different types of character n-grams, and various combinations of them. The paper also experiments with different feature representations and machine-learning algorithms. Moreover, the paper demonstrates that the performance of the character n-gram approach can be improved by fine-tuning the feature set and by appropriately selecting the length and type of character n-grams. This relatively simple and language-independent approach to the AA task outperforms both a bag-of-words baseline and other approaches, using the same corpus.Mexican Government (Conacyt) [240844, 20161958]; Mexican Government (SIP-IPN) [20171813, 20171344, 20172008]; Mexican Government (SNI); Mexican Government (COFAA-IPN)

    Detección automática de primitivos semánticos con algoritmos bioinspirados

    No full text
    Cualquier diccionario explicativo tradicional inevitablemente contiene ciclos en sus definiciones, es decir, si una palabra es definida en el diccionario y después se usa en una definición, siempre existe un camino en este diccionario que regresa a la misma palabra. Un ejemplo de un ciclo de longitud dos: “pacto es convenio”, “convenio es tratado”, “tratado es pacto”: en dos pasos regresamos a la misma palabra. En un buen diccionario los ciclos son largos, pero son inevitables. Un diccionario semántico computacional (destinado para el uso de las computadoras) no puede contener ciclos en sus definiciones sin que éstos afecten la capacidad de inferencia lógica de los sistemas computacionales. Denominamos primitivas semánticas a un conjunto de palabras que de ser eliminadas del diccionario lo mantendría sin ciclos, es decir, esas palabras no tendrán definición en el diccionario, y en este sentido son primitivas. En esta tesis, nuestra meta es mantener la mayor cantidad de palabras en el diccionario obteniendo un número mínimo de las primitivas semánticas. Presentamos un método que extrae el conjunto de primitivas más pequeño hasta ahora. Para eso utilizamos la representación del diccionario como un grafo dirigido y aplicamos algoritmos bioinspirados que determinan el orden en que el grafo debe ser construido

    Representación computacional de la escritura maya

    No full text
    Tesis (Maestría en Ciencias de la Computación), Instituto Politécnico Nacional, CIC, 2008, 1 archivo PDF, (88 páginas). tesis.ipn.m

    Automatic detection of semantic primitives using optimization based on genetic algorithm

    No full text
    In this article, we propose a method for the automatic retrieval of a set of semantic primitive words from an explanatory dictionary and a novel evaluation procedure for the obtained set of primitives. The approach is based on the representation of the dictionary as a directed graph with a single-objective constrained optimization problem via a genetic algorithm with the PageRank scoring model. The problem is defined as a subset selection. The algorithm is fit to search for the sets of words that should fulfil several requirements: the cardinality of the set should not exceed empirically selected limits and the PageRank word importance score is minimized with cycle prevention thresholding. In the experiments, we used the WordNet dictionary for English. The proposed method is an improvement over the previous state-of-the-art solutions

    Detección automática de primitivas semánticas en diccionarios explicativos con algoritmos bioinspirados

    No full text
    Inevitably, any explanatory dictionary contains cycles in its definitions, that is, if a word is defined in the dictionary and then used in a definition, there is always a path in the dictionary that returns to the same word. In a good dictionary the cycles are long, but they are unavoidable. A computational dictionary cannot contain any cycles in its definitions without them affecting the ability of logical inference of computer systems. In this study, we name semantic primitives to such words in the dictionary that if removed, the cycles would be eliminated; that is, those words would not have a definition and, in this sense, they are primitive. In this research, our goal is to keep as many words in the dictionary, i.e., to minimize the number of semantic primitives. We present a method that achieves the smallest set of primitives obtained so far. In order to accomplish this, the representation of the dictionary was used as a directed graph, and a differential evolution algorithm, that determines the order in which the graph should be built, was applied to the dictionary.Cualquier diccionario explicativo tradicional inevitablemente contiene ciclos en sus definiciones, es decir, si una palabra es definida en el diccionario y después se usa en una definición, siempre existe un camino en el diccionario que regresa a la misma palabra. En un buen diccionario los ciclos son largos, pero son inevitables. Un diccionario semántico computacional (destinado para el uso de las computadoras) no puede contener ciclos en sus definiciones sin que estos afecten la capacidad de inferencia lógica de los sistemas computacionales. Denominamos primitivas semánticas a un conjunto de palabras que de ser eliminadas del diccionario lo mantendría sin ciclos, es decir, esas palabras no tendrán la definición en el diccionario, y en este sentido son primitivas. En esta investigación, nuestra meta es mantener la mayor cantidad de palabras en el diccionario, es decir, tener un número mí- nimo de las primitivas semánticas. Presentamos un método que obtiene el conjunto de primitivas más pequeño obtenido hasta ahora. Para eso utilizamos la representación del diccionario como un grafo dirigido y aplicamos un algoritmo de evolución diferencial que determina el orden en que el grafo debe ser construido

    Unified, Labeled, and Semi-Structured Database of Pre-Processed Mexican Laws

    Get PDF
    This paper presents a corpus of pre-processed Mexican laws for computational tasks. The main contributions are the proposed JSON structure and the methodology used to achieve the semi-structured corpus with the selected algorithms. Law PDF documents were transformed into plain text, unified by a deconstruction of law–document structure, and labeled with natural language processing techniques considering part of speech (PoS); a process of entity extraction was also performed. The corpus includes the Mexican constitution and the Mexican laws that were collected from the official site in PDF format repealed before 14 October 2021. The collection has 305 documents, including: the Mexican constitution, 289 laws, 8 federal codes, 3 regulations, 2 statutes, 1 decree, and 1 ordinance. The semi-structured database includes the transformation of the set of laws from PDF format to a digital representation in order to facilitate its computational analysis. The documents were migrated to JSON type files to represent internal hierarchical relations. In addition, basic natural language processing techniques were implemented on laws for the identification of part of speech and named entities. The presented data set is mainly useful for text analysis and data science. It could be used for various legislative analysis tasks including: comprehension, interpretation, translation, classification, accessibility, coherence, and searches. Finally, we present some statistic of the identified entities and an example of the usefulness of the corpus for environmental laws
    corecore